From bb907155585ca3bf794cf4d41171ad810c207304 Mon Sep 17 00:00:00 2001 From: "akw27@labyrinth.cl.cam.ac.uk" Date: Fri, 27 Aug 2004 14:59:26 +0000 Subject: [PATCH] bitkeeper revision 1.1159.58.1 (412f4c4egZceX9qbmExr-wa_i_VDWw) Notes on the innerworkings of the blkif drivers. --- .rootkeys | 1 + docs/blkif-drivers-explained.txt | 477 +++++++++++++++++++++++++++++++ 2 files changed, 478 insertions(+) create mode 100644 docs/blkif-drivers-explained.txt diff --git a/.rootkeys b/.rootkeys index bbf5c23855..3498de4277 100644 --- a/.rootkeys +++ b/.rootkeys @@ -12,6 +12,7 @@ 4021053fmeFrEyPHcT8JFiDpLNgtHQ docs/HOWTOs/Xen-HOWTO 4022a73cgxX1ryj1HgS-IwwB6NUi2A docs/HOWTOs/XenDebugger-HOWTO 3f9e7d53iC47UnlfORp9iC1vai6kWw docs/Makefile +412f4bd9sm5mCQ8BkrgKcAKZGadq7Q docs/blkif-drivers-explained.txt 3f9e7d60PWZJeVh5xdnk0nLUdxlqEA docs/eps/xenlogo.eps 3f9e7d63lTwQbp2fnx7yY93epWS-eQ docs/figs/dummy 3f9e7d564bWFB-Czjv1qdmE6o0GqNg docs/interface.tex diff --git a/docs/blkif-drivers-explained.txt b/docs/blkif-drivers-explained.txt new file mode 100644 index 0000000000..8f6f7a498a --- /dev/null +++ b/docs/blkif-drivers-explained.txt @@ -0,0 +1,477 @@ +=== How the Blkif Drivers Work === +Andrew Warfield +andrew.warfield@cl.cam.ac.uk + +The intent of this is to explain at a fairly detailed level how the +split device drivers work in Xen 1.3 (aka 2.0beta). The intended +audience for this, I suppose, is anyone who intends to work with the +existing blkif interfaces and wants something to help them get up to +speed with the code in a hurry. Secondly though, I hope to break out +the general mechanisms that are used in the drivers that are likely to +be necessary to implement other drivers interfaces. + +As a point of warning before starting, it is worth mentioning that I +anticipate much of the specifics described here changing in the near +future. There has been talk about making the blkif protocol +a bit more efficient than it currently is. Keir's addition of grant +tables will change the current remapping code that is used when shared +pages are initially set up. + +Also, writing other control interface types will likely need support +from Xend, which at the moment has a steep learning curve... this +should be addressed in the future. + +For more information on the driver model as a whole, read the +"Reconstructing I/O" technical report +(http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf). + +==== High-level structure of a split-driver interface ==== + +Why would you want to write a split driver in the first place? As Xen +is a virtual machine manager and focuses on isolation as an initial +design principle, it is generally considered unwise to share physical +access to devices across domains. The reasons for this are obvious: +when device resources are shared, misbehaving code or hardware can +result in the failure of all of the client applications. Moreover, as +virtual machines in Xen are entire OSs, standard device drives that +they might use cannot have multiple instantiations for a single piece +of hardware. In light of all this, the general approach in Xen is to +give a single virtual machine hardware access to a device, and where +other VMs want to share the device, export a higher-level interface to +facilitate that sharing. If you don't want to share, that's fine. +There are currently Xen users actively exploring running two +completely isolated X-Servers on a Xen host, each with it's own video +card, keyboard, and mouse. In these situations, the guests need only +be given physical access to the necessary devices and left to go on +their own. However, for devices such as disks and network interfaces, +where sharing is required, the split driver approach is a good +solution. + +The structure is like this: + + +--------------------------+ +--------------------------+ + | Domain 0 (privileged) | | Domain 1 (unprivileged) | + | | | | + | Xend ( Application ) | | | + | Blkif Backend Driver | | Blkif Frontend Driver | + | Physical Device Driver | | | + +--------------------------+ +--------------------------+ + +--------------------------------------------------------+ + | X E N | + +--------------------------------------------------------+ + + +The Blkif driver is in two parts, which we refer to as frontend (FE) +and a backend (BE). Together, they serve to proxy device requests +between the guest operating system in an unprivileged domain, and the +physical device driver in the physical domain. An additional benefit +to this approach is that the FE driver can provide a single interface +for a whole class of physical devices. The blkif interface mounts +IDE, SCSI, and our own VBD-structured disks, independent of the +physical driver underneath. Moreover, supporting additional OSs only +requires that a new FE driver be written to connect to the existing +backend. + +==== Inter-Domain Communication Mechanisms ==== + +===== Event Channels ===== + +Before getting into the specifics of the block interface driver, it is +worth discussing the mechanisms that are used to communicate between +domains. Two mechanisms are used to allow the construction of +high-performance drivers: event channels and shared-memory rings. + +Event channels are an asynchronous interdomain notification +mechanism. Xen allows channels to be instantiated between two +domains, and domains can request that a virtual irq be attached to +notifications on a given channel. The result of this is that the +frontend domain can send a notification on an event channel, resulting +in an interrupt entry into the backend at a later time. + +The event channel between two domains is instantiated in the Xend code +during driver startup (described later). Xend's channel.py +(tools/python/xen/xend/server/channel.py) defines the function + + +def eventChannel(dom1, dom2): + return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2) + + +which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c, +which in turn generates a hypercall to Xen to patch the event channel +between the domains. Only a privileged domain can request the +creation of an event channel. + +Once the event channel is created in Xend, its ends are passed to both the +front and backend domains over the control channel. The end that is +passed to a domain is just an integer "port" uniquely identifying the +event channel's local connection to that domain. An example of this +setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in +blkif_status_change, which receives several status change events as +the driver starts up. It is passed an event channel end in a +BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this: + + + blkif_evtchn = status->evtchn; + blkif_irq = bind_evtchn_to_irq(blkif_evtchn); + if ( (rc = request_irq(blkif_irq, blkif_int, + SA_SAMPLE_RANDOM, "blkif", NULL)) ) + printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc); + + +This code associates a virtual irq with the event channel, and +attaches the function blkif_int() as an interrupt handler for that +irq. blkif_int() simply handles the notification and returns, it does +not need to interact with the channel at all. + +An example of generating a notification can also be seen in blkfront.c: + + +static inline void flush_requests(void) +{ + DISABLE_SCATTERGATHER(); + wmb(); /* Ensure that the frontend can see the requests. */ + blk_ring->req_prod = req_prod; + notify_via_evtchn(blkif_evtchn); +} +}}} + +notify_via_evtchn issues a hypercall to set the event waiting flag on +the other domain's end of the channel. + +===== Communication Rings ===== + +Event channels are strictly a notification mechanism between domains. +To move large chunks of data back and forth, Xen allows domains to +share pages of memory. We use communication rings as a means of +managing access to a shared memory page for message passing between +domains. These rings are not explicitly a mechanism of Xen, which is +only concerned with the actual sharing of the page and not how it is +used, they are however worth discussing as they are used in many +places in the current code and are a useful model for communicating +across a shared page. + +A shared page is set up by a guest first allocating and passing the +address of a page in its own address space to the backend driver. + + + blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL); + blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0; + ... + /* Construct an interface-CONNECT message for the domain controller. */ + cmsg.type = CMSG_BLKIF_FE; + cmsg.subtype = CMSG_BLKIF_FE_INTERFACE_CONNECT; + cmsg.length = sizeof(blkif_fe_interface_connect_t); + up.handle = 0; + up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT; + memcpy(cmsg.msg, &up, sizeof(up)); + + +blk_ring will be the shared page. The producer and consumer pointers +are then initialised (these will be discussed soon), and then the +machine address of the page is send to the backend via a control +channel to Xend. This control channel itself uses the notification +and shared memory mechanisms described here, but is set up for each +domain automatically at startup. + +The backend, which is a privileged domain then takes the page address +and maps it into its own address space (in +linux26/drivers/xen/blkback/interface.c:blkif_connect()): + + +void blkif_connect(blkif_be_connect_t *connect) + + ... + unsigned long shmem_frame = connect->shmem_frame; + ... + + if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL ) + { + connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY; + return; + } + + prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED); + error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr), + shmem_frame<blk_ring_base = (blkif_ring_t *)vma->addr +}}} + +The machine address of the page is passed in the shmem_frame field of +the connect message. This is then mapped into the virtual address +space of the backend domain, and saved in the blkif structure +representing this particular backend connection. + +NOTE: New mechanisms will be added very shortly to allow domains to +explicitly grant access to their pages to other domains. This "grant +table" support is in the process of being added to the tree, and will +change the way a shared page is set up. In particular, it will remove +the need of the remapping domain to be privileged. + +Sending data across shared rings: + +Shared rings avoid the potential for write interference between +domains in a very cunning way. A ring is partitioned into a request +and a response region, and domains only work within their own space. +This can be thought of as a double producer-consumer ring -- the ring +is described by four pointers into a circular buffer of fixed-size +records. Pointers may only advance, and may not pass one another. + + + rsp_cons----+ + V + +----+----+----+----+----+----+----+ + | | | free |RSP1|RSP2| + +----+----+----+----+----+----+----+ + req_prod->| | --------> |RSP3| + +----+ +----+ + |REQ8| | |<-rsp_prod + +----+ +----+ + |REQ7| | | + +----+ +----+ + |REQ6| <-------- | | + +----+----+----+----+----+----+----+ + |REQ5|REQ4| free | | | + +----+----+----+----+----+----+----+ + req_cons---------^ + + + +By adopting the convention that every request will receive a response, +not all four pointers need be shared and flow control on the ring +becomes very easy to manage. Each domain manages its own +consumer pointer, and the two producer pointers are visible to both (Xen/include/hypervisor-ifs/io/blkif.h): + + + +/* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/ + #define BLKIF_RING_SIZE 64 + + ... + +/* + * We use a special capitalised type name because it is _essential_ that all + * arithmetic on indexes is done on an integer type of the correct size. + */ +typedef u32 BLKIF_RING_IDX; + +/* + * Ring indexes are 'free running'. That is, they are not stored modulo the + * size of the ring buffer. The following macro converts a free-running counter + * into a value that can directly index a ring-buffer array. + */ +#define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1)) + +typedef struct { + BLKIF_RING_IDX req_prod; /* 0: Request producer. Updated by front-end. */ + BLKIF_RING_IDX resp_prod; /* 4: Response producer. Updated by back-end. */ + union { /* 8 */ + blkif_request_t req; + blkif_response_t resp; + } PACKED ring[BLKIF_RING_SIZE]; +} PACKED blkif_ring_t; + + + +As shown in the diagram above, the rules for using a shared memory +ring are simple. + + 1. A ring is full when a domain's producer and consumer pointers are + equal (e.g. req_prod == resp_cons). In this situation, the + consumer pointer must be advanced. Furthermore, if the consumer + pointer is equal to the other domain's producer pointer, + (e.g. resp_cons = resp_prod), then the other domain has all the + buffers. + +2. Producer pointers point to the next buffer that will be written to. + (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.) + +3. Consumer pointers point to a valid message, so long as they are not + equal to the associated producer pointer. + +4. A domain should only ever write to the message pointed + to by its producer index, and read from the message at it's + consumer. More generally, the domain may be thought of to have + exclusive access to the messages between its consumer and producer, + and should absolutely not read or write outside this region. + +In general, drivers keep a private copy of their producer pointer and +then set the shared version when they are ready for the other end to +process a set of messages. Additionally, it is worth paying attention +to the use of memory barriers (rmb/wmb) in the code, to ensure that +rings that are shared across processors behave as expected. + +==== Structure of the Blkif Drivers ==== + +Now that the communications primitives have been discussed, I'll +quickly cover the general structure of the blkif driver. This is +intended to give a high-level idea of what is going on, in an effort +to make reading the code a more approachable task. + +There are three key software components that are involved in the blkif +drivers (not counting Xen itself). The frontend and backend driver, +and Xend, which coordinates their initial connection. Xend may also +be involved in control-channel signalling in some cases after startup, +for instance to manage reconnection if the backend is restarted. + +===== Frontend Driver Structure ===== + +The frontend domain uses a single event channel and a shared memory +ring to trade control messages with the backend. These are both setup +during domain startup, which will be discussed shortly. The shared +memory ring is called blkif_ring, and the private ring indexes are +resp_cons, and req_prod. The ring is protected by blkif_io_lock. +Additionally, the frontend keeps a list of outstanding requests in +rec_ring[]. These are uniquely identified by a guest-local id number, +which is associated with each request sent to the backend, and +returned with the matching responses. Information about the actual +disks are stored in major_info[], of which only the first nr_vbds +entries are valid. Finally, the global 'recovery' indicates that the +connection between the backend and frontend drivers has been broken +(possibly due to a backend driver crash) and that the frontend is in +recovery mode, in which case it will attempt to reconnect and reissue +outstanding requests. + +The frontend driver is single-threaded and after setup is entered only +through three points: (1) read/write requests from the XenLinux guest +that it is a part of, (2) interrupts from the backend driver on its +event channel (blkif_int()), and (3) control messages from Xend +(blkif_ctrlif_rx). + +===== Backend Driver Structure ===== + +The backend driver is slightly more complex as it must manage any +number of concurrent frontend connections. For each domain it +manages, the backend driver maintains a blkif structure, which +describes all the connection and disk information associated with that +particular domain. This structure is associated with the interrupt +registration, and allows the backend driver to have immediate context +when it takes a notification from some domain. + +All of the blkif structures are stored in a hash table (blkif_hash), +which is indexed by a hash of the domain id, and a "handle", really a +per-domain blkif identifier, in case it wants to have multiple connections. + +The per-connection blkif structure is of type blkif_t. It contains +all of the communication details (event channel, irq, shared memory +ring and indexes), and blk_ring_lock, which is the backend mutex on +the shared ring. The structure also contains vbd_rb, which is a +red-black tree, containing an entry for each device/partition that is +assigned to that domain. This structure is filled by xend passing +disk information to the backend at startup, and is protected by +vbd_lock. Finally, the blkif struct contains a status field, which +describes the state of the connection. + +The backend driver spawns a kernel thread at startup +(blkio_schedule()), which handles requests to and from the actual disk +device drivers. This scheduler thread maintains a list of blkif +structures that have pending requests, and services them round-robin +with a maximum per-round request limit. blkifs are added to the list +in the interrupt handler (blkif_be_int()) using +add_to_blkdev_list_tail(), and removed in the scheduler loop after +calling do_block_io_op(), which processes a batch of requests. The +scheduler thread is explicitly activated at several points in the code +using maybe_trigger_blkio_schedule(). + +Pending requests between the backend driver and the physical device +drivers use another ring, pending_ring. Requests are placed in this +ring in the scheduler thread and issued to the device. A completion +callback, end_block_io_op, indicates that requests have been serviced +and generates a response on the appropriate blkif ring. pending +reqs[] stores a list of outstanding requests with the physical drivers. + +So, control entries to the backend are (1) the blkio scheduler thread, +which sends requests to the real device drivers, (2) end_block_io_op, +which is called as serviced requests complete, (3) blkif_be_int() +handles notifications from the frontend drivers in other domains, and +(4) blkif_ctrlif_rx() handles control messages from xend. + +==== Driver Startup ==== + +Prior to starting a new guest using the frontend driver, the backend +will have been started in a privileged domain. The backend +initialisation code initialises all of its data structures, such as +the blkif hash table, and starts the scheduler thread as a kernel +thread. It then sends a driver status up message to let xend know it +is ready to take frontend connections. + +When a new domain that uses the blkif frontend driver is started, +there are a series of interactions between it, xend, and the specified +backend driver. These interactions are as follows: + +The domain configuration given to xend will specify the backend domain +and disks that the new guest is to use. Prior to actually running the +domain, xend and the backend driver interact to setup the initial +blkif record in the backend. + +(1) Xend sends a BLKIF_BE_CREATE message to backend. + + Backend does blkif_create(), having been passed FE domid and handle. + It creates and initialises a new blkif struct, and puts it in the + hash table. + It then returns a STATUS_OK response to xend. + +(2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend. + + Backend adds a vbd entry in the red-black tree for the + specified (dom, handle) blkif entry. + Sends a STATUS_OK response. + +(3) Xend sends a BLKIF_BE_VBD_GROW message to the backend. + + Backend takes the physical device information passed in the + message and assigns them to the newly created vbd struct. + +(2) and (3) repeat as any additional devices are added to the domain. + +At this point, the backend has enough state to allow the frontend +domain to start. The domain is run, and eventually gets to the +frontend driver initialisation code. After setting up the frontend +data structures, this code continues the communications with xend and +the backend to negotiate a connection: + +(4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message. + + This message tells xend that the driver is up. The init function + now spin-waits until driver setup is complete in order to prevent + Linux from attempting to boot before the disks are connected. + +(5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message + + This message specifies that the interface is now disconnected + (instead of closed). + The domain updates it's state, and allocates the shared blk_ring + page. Next, + +(6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message + + This message specifies the domain and handle, and includes the + address of the newly created page. + +(7) Xend sends the backend a BLKIF_BE_CONNECT message + + The backend fills in the blkif connection information, maps the + shared page, and binds an irq to the event channel. + +(8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message + + This message takes the frontend driver to a CONNECTED state, at + which point it binds an irq to the event channel and calls + xlvbd_init to initialise the individual block devices. + +The frontend Linux is stall spin waiting at this point, until all of +the disks have been probed. Messaging now is directly between the +front and backend domain using the new shared ring and event channel. + +(9) The frontend sends a BLKIF_OP_PROBE directly to the backend. + + This message includes a reference to an additional page, that the + backend can use for it's reply. The backend responds with an array + of the domains disks (as vdisk_t structs) on the provided page. + +The frontend now initialises each disk, calling xlvbd_init_device() +for each one. -- 2.30.2